Keyword Search Result

[Keyword] parallel process(118hit)

21-40hit(118hit)

  • Optimized Implementation of Pedestrian Tracking Using Multiple Cues on GPU

    Ryusuke MIYAMOTO  Hiroki SUGANO  

     
    PAPER-Image Processing

      Vol:
    E94-A No:11
      Page(s):
    2323-2333

    Nowadays, pedestrian recognition for automotive and security applications that require accurate recognition in images taken from distant observation points is a recent challenging problem in the field of computer vision. To achieve accurate recognition, both detection and tracking must be precise. For detection, some excellent schemes suitable for pedestrian recognition from distant observation points are proposed, however, no tracking schemes can achieve sufficient performance. To construct an accurate tracking scheme suitable for pedestrian recognition from distant observation points, we propose a novel pedestrian tracking scheme using multiple cues: HSV histograms and HOG features. Experimental results show that the proposed scheme can properly track a target pedestrian where tracking schemes using only a single cue fails. Moreover, we implement the proposed scheme on NVIDIA® TeslaTM C1060 processor, one of the latest GPU, to achieve real-time processing of the proposed scheme. Experimental results show that computation time required for tracking of a frame by our implementation is reduced to 8.80 ms even though Intel® CoreTM i7 CPU 975 @ 3.33 GHz spends 111 ms.

  • High-Speed FPGA Implementation of the SHA-1 Hash Function

    Je-Hoon LEE  Sang-Choon KIM  Young-Jun SONG  

     
    LETTER-Cryptography and Information Security

      Vol:
    E94-A No:9
      Page(s):
    1873-1876

    This paper presents a high-speed SHA-1 implementation. Unlike the conventional unfolding transformation, the proposed unfolding transformation technique makes the combined hash operation blocks to have almost the same delay overhead regardless of the unfolding factor. It can achieve high throughput of SHA-1 implementation by avoiding the performance degradation caused by the first hash computation. We demonstrate the proposed SHA-1 architecture on a FPGA chip. From the experimental results, the SHA-1 architecture with unfolding factor 5 shows 1.17 Gbps. The proposed SHA-1 architecture can achieve about 31% performance improvements compared to its counterparts. Thus, the proposed SHA-1 can be applicable for the security of the high-speed but compact mobile appliances.

  • A Fast Divide-and-Conquer Algorithm for Indexing Human Genome Sequences

    Woong-Kee LOH  Yang-Sae MOON  Wookey LEE  

     
    PAPER-Fundamentals of Information Systems

      Vol:
    E94-D No:7
      Page(s):
    1369-1377

    Since the release of human genome sequences, one of the most important research issues is about indexing the genome sequences, and the suffix tree is most widely adopted for that purpose. The traditional suffix tree construction algorithms suffer from severe performance degradation due to the memory bottleneck problem. The recent disk-based algorithms also provide limited performance improvement due to random disk accesses. Moreover, they do not fully utilize the recent CPUs with multiple cores. In this paper, we propose a fast algorithm based on `divide-and-conquer' strategy for indexing the human genome sequences. Our algorithm nearly eliminates random disk accesses by accessing the disk in the unit of contiguous chunks. In addition, our algorithm fully utilizes the multi-core CPUs by dividing the genome sequences into multiple partitions and then assigning each partition to a different core for parallel processing. Experimental results show that our algorithm outperforms the previous fastest DIGEST algorithm by up to 10.5 times.

  • NUFFT- & GPU-Based Fast Imaging of Vegetation

    Amedeo CAPOZZOLI  Claudio CURCIO  Antonio DI VICO  Angelo LISENO  

     
    PAPER-Sensing

      Vol:
    E94-B No:7
      Page(s):
    2092-2103

    We develop an effective algorithm, based on the filtered backprojection (FBP) approach, for the imaging of vegetation. Under the FBP scheme, the reconstruction amounts at a non-trivial Fourier inversion, since the data are Fourier samples arranged on a non-Cartesian grid. The computational issue is efficiently tackled by Non-Uniform Fast Fourier Transforms (NUFFTs), whose complexity grows asymptotically as that of a standard FFT. Furthermore, significant speed-ups, as compared to fast CPU implementations, are obtained by a parallel versions of the NUFFT algorithm, purposely designed to be run on Graphic Processing Units (GPUs) by using the CUDA language. The performance of the parallel algorithm has been assessed in comparison to a CPU-multicore accelerated, Matlab implementation of the same routine, to other CPU-multicore accelerated implementations based on standard FFT and employing linear, cubic, spline and sinc interpolations and to a different, parallel algorithm exploiting a parallel linear interpolation stage. The proposed approach has resulted the most computationally convenient. Furthermore, an indoor, polarimetric experimental setup is developed, capable to isolate and introduce, one at a time, different non-idealities of a real acquisition, as the sources (wind, rain) of temporal decorrelation. Experimental far-field polarimetric measurements on a thuja plicata (western redcedar) tree point out the performance of the set up algorithm, its robustness against data truncation and temporal decorrelation as well as the possibility of discriminating scatterers with different features within the investigated scene.

  • Energy-Saving Stochastic Scheduling of a Real-Time Parallel Task with Varying Computation Amount on Multi-Core Processors

    Wan Yeon LEE  Kyong Hoon KIM  

     
    LETTER-Systems and Control

      Vol:
    E94-A No:2
      Page(s):
    842-845

    The proposed scheduling scheme minimizes the mean energy consumption of a real-time parallel task, where the task has the probabilistic computation amount and can be executed concurrently on multiple cores. The scheme determines a pertinent number of cores allocated to the task execution and the instant frequency supplied to the allocated cores. Evaluation shows that the scheme saves manifest amount of the energy consumed by the previous method minimizing the mean energy consumption on a single core.

  • High-Speed Computation of the Kleene Star in Max-Plus Algebraic System Using a Cell Broadband Engine

    Hiroyuki GOTO  

     
    PAPER-Fundamentals of Information Systems

      Vol:
    E93-D No:7
      Page(s):
    1798-1806

    This research addresses a high-speed computation method for the Kleene star of the weighted adjacency matrix in a max-plus algebraic system. We focus on systems whose precedence constraints are represented by a directed acyclic graph and implement it on a Cell Broadband EngineTM (CBE) processor. Since the resulting matrix gives the longest travel times between two adjacent nodes, it is often utilized in scheduling problem solvers for a class of discrete event systems. This research, in particular, attempts to achieve a speedup by using two approaches: parallelization and SIMDization (Single Instruction, Multiple Data), both of which can be accomplished by a CBE processor. The former refers to a parallel computation using multiple cores, while the latter is a method whereby multiple elements are computed by a single instruction. Using the implementation on a Sony PlayStation 3TM equipped with a CBE processor, we found that the SIMDization is effective regardless of the system's size and the number of processor cores used. We also found that the scalability of using multiple cores is remarkable especially for systems with a large number of nodes. In a numerical experiment where the number of nodes is 2000, we achieved a speedup of 20 times compared with the method without the above techniques.

  • An Optimization System with Parallel Processing for Reducing Common-Mode Current on Electronic Control Unit

    Yuji OKAZAKI  Takanori UNO  Hideki ASAI  

     
    PAPER

      Vol:
    E93-C No:6
      Page(s):
    827-834

    In this paper, we propose an optimization system with parallel processing for reducing electromagnetic interference (EMI) on electronic control unit (ECU). We adopt simulated annealing (SA), genetic algorithm (GA) and taboo search (TS) to seek optimal solutions, and a Spice-like circuit simulator to analyze common-mode current. Therefore, the proposed system can determine the adequate combinations of the parasitic inductance and capacitance values on printed circuit board (PCB) efficiently and practically, to reduce EMI caused by the common-mode current. Finally, we apply the proposed system to an example circuit to verify the validity and efficiency of the system.

  • Column-Parallel Vision Chip Architecture for High-Resolution Line-of-Sight Detection Including Saccade

    Junichi AKITA  Hiroaki TAKAGI  Keisuke DOUMAE  Akio KITAGAWA  Masashi TODA  Takeshi NAGASAKI  Toshio KAWASHIMA  

     
    PAPER-Image Sensor/Vision Chip

      Vol:
    E90-C No:10
      Page(s):
    1869-1875

    Although the line-of-sight (LoS) is expected to be useful as input methodology for computer systems, the application area of the conventional LoS detection system composed of video camera and image processor is restricted in the specialized area, such as academic research, due to its large size and high cost. There is a rapid eye motion, so called 'saccade' in our eye motion, which is expected to be useful for various applications. Because of the saccade's very high speed, it is impossible to track the saccade without using high speed camera. The authors have been proposing the high speed vision chip for LoS detection including saccade based on the pixel parallel processing architecture, however, its resolution is very low for the large size of its pixel. In this paper, we propose and discuss an architecture of the vision chip for LoS detection including saccade based on column-parallel processing manner for increasing the resolution with keeping high processing speed.

  • Media Processing LSI Architectures for Automotives -- Challenges and Future Trends --

    Ichiro KURODA  Shorin KYO  

     
    INVITED PAPER

      Vol:
    E90-C No:10
      Page(s):
    1850-1857

    This paper presents media processor architectures for automotive applications. Media processing applications with their requirements for LSI implementations are first described for vision based driver assistance as well as graphical user interface for car navigation using 3D graphics. Then, parallel processing architectures for vision and graphics in these applications are reviewed with their performance and cost. After that, future trends of automotive media processing such as integration of vision and 3D graphics functions are shown with their applications and the required performance. Moreover, parallel processing architectures are discussed for the integration of vision and graphics. Finally, an prospect of a next-generation media processing LSI for automotives is provided.

  • Hamiltonian Cycles and Hamiltonian Paths in Faulty Burnt Pancake Graphs

    Keiichi KANEKO  

     
    PAPER-Algorithm Theory

      Vol:
    E90-D No:4
      Page(s):
    716-721

    Recently, research on parallel processing systems is very active, and many complex topologies have been proposed. A burnt pancake graph is one such topology. In this paper, we prove that a faulty burnt pancake graph with degree n has a fault-free Hamiltonian cycle if the number of the faulty elements is n-2 or less, and it has a fault-free Hamiltonian path between any pair of nonfaulty nodes if the number of the faulty elements is n-3 or less.

  • Real-Time Huffman Encoder with Pipelined CAM-Based Data Path and Code-Word-Table Optimizer

    Takeshi KUMAKI  Yasuto KURODA  Masakatsu ISHIZAKI  Tetsushi KOIDE  Hans Jurgen MATTAUSCH  Hideyuki NODA  Katsumi DOSAKA  Kazutami ARIMOTO  Kazunori SAITO  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E90-D No:1
      Page(s):
    334-345

    This paper presents a novel optimized real-time Huffman encoder using a pipelined data path based on CAM technology and a parallel code-word-table optimizer. The exploitation of CAM technology enables fast parallel search of the code word table. At the same time, the code word table is optimized according to the frequency of received input symbols and is up-dated in real-time. Since these two functions work in parallel, the proposed architecture realizes fast parallel encoding and keeps a constantly high compression ratio. Evaluation results for the JPEG application show that the proposed architecture can achieve up to 28% smaller encoded picture sizes than the conventional architectures. The obtained encoding time can be reduced by 95% in comparison to a conventional SRAM-based architecture, which is suitable even for the latest end-user-devices requiring fast frame-rates. Furthermore, the proposed architecture provides the only encoder that can simultaneously realize small compressed data size and fast processing speed.

  • Scalable FPGA/ASIC Implementation Architecture for Parallel Table-Lookup-Coding Using Multi-Ported Content Addressable Memory

    Takeshi KUMAKI  Yutaka KONO  Masakatsu ISHIZAKI  Tetsushi KOIDE  Hans Jurgen MATTAUSCH  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E90-D No:1
      Page(s):
    346-354

    This paper presents a scalable FPGA/ASIC implementation architecture for high-speed parallel table-lookup-coding using multi-ported content addressable memory, aiming at facilitating effective table-lookup-coding solutions. The multi-ported CAM adopts a Flexible Multi-ported Content Addressable Memory (FMCAM) technology, which represents an effective parallel processing architecture and was previously reported in [1]. To achieve a high-speed parallel table-lookup-coding solution, FMCAM is improved by additional schemes for a single search mode and counting value setting mode, so that it permits fast parallel table-lookup-coding operations. Evaluation results for Huffman encoding within the JPEG application show that a synthesized semi-custom ASIC implementation of the proposed architecture can already reduce the required clock-cycle number by 93% in comparison to a conventional DSP. Furthermore, the performance per area unit, measured in MOPS/mm2, can be improved by a factor of 3.8 in comparison to parallel operated DSPs. Consequently, the proposed architecture is very suitable for FPGA/ASIC implementation, and is a promising solution for small area integrated realization of real-time table-lookup-coding applications.

  • Design and Evaluation of a Massively Parallel Processor Based on Matrix Architecture

    Toru SHIMIZU  Masami NAKAJIMA  Masahiro KAINAGA  

     
    INVITED PAPER

      Vol:
    E89-C No:11
      Page(s):
    1512-1518

    This paper describes the design and evaluation of a massively parallel processor base on Matrix architecture which is suitable for portable multimedia applications. The proposed architecture in this paper achieves 40 GOPS of 16-bit fixed-point additions at 200 MHz clock frequency and 250 mW power dissipation. In addition, 1 M-bit SRAM for data registers and 2,048 2-bit processing elements connected by a flexible switching network are integrated in 3.1 mm2 in 90 nm low-power CMOS technology. The energy-efficient Matrix architecture supports 2,048-way parallel operations and the programmable functions required for multimedia SoCs.

  • A Power- and Area-Efficient SRAM Core Architecture with Segmentation-Free and Horizontal/Vertical Accessibility for Super-Parallel Video Processing

    Junichi MIYAKOSHI  Yuichiro MURACHI  Tomokazu ISHIHARA  Hiroshi KAWAGUCHI  Masahiko YOSHIMOTO  

     
    PAPER

      Vol:
    E89-C No:11
      Page(s):
    1629-1636

    For super-parallel video processing, we proposed a power- and area-efficient SRAM core architecture with a segmentation-free access, which means accessibility to arbitrary consecutive pixels, and horizontal/vertical access. To achieve these flexible accesses, a spirally-connected local-wordline select signal and multi-selection scheme in wordlines are proposed, so that extra X-decoders in the conventional multi-division SRAM can be eliminated. Consequently, the proposed SRAM reduces a power and area by 57-60% and 60%, respectively, when it is applied to a 128 parallel architecture. The proposed 160-kbit SRAM with 16-read ports (2-read port SRAM with eight-parallel architecture) is implemented to a search window buffer for an H.264 motion estimation processor core which dissipates 800 µW for QCIF 15-fps in a 130-nm technology.

  • Vision Chip Architecture for Detecting Line of Sight Including Saccade

    Junichi AKITA  Hiroaki TAKAGI  Takeshi NAGASAKI  Masashi TODA  Toshio KAWASHIMA  Akio KITAGAWA  

     
    PAPER

      Vol:
    E89-C No:11
      Page(s):
    1605-1611

    Rapid eye motion, or so called saccade, is a very quick eye motion which always occurs regardless of our intention. Although the line of sight (LOS) with saccade tracking is expected to be used for a new type of computer-human interface, it is impossible to track it using the conventional video camera, because of its speed which is often up to 600 degrees per second. Vision Chip is an intelligent image sensor which has the photo receptor and the image processing circuitry on a single chip, which can process the acquired image information by keeping its spatial parallelism. It has also the ability of implementing the very compact integrated vision system. In this paper, we describe the vision chip architecture which has the capability of detecting the line of sight from infrared eye image, with the processing speed supporting the saccade tracking. The vision chip described here has the pixel parallel processing architecture, with the node automata for each pixel as image processing. The acquired image is digitized to two flags indicating the Purkinje's image and the pupil by comparators at first. The digitized images are then shrunk, followed by several steps of expanding by node automata located at each pixel. The shrinking process is kept executed until all the pixels disappear, and the pixel disappearing at last indicates the center of the Purkinje's image and the pupil. This disappearing step is detected by the projection circuitry in pixel circuit for fast operation, and the coordinates of the center of the Purkinje's image and the pupil are generated by the simple encoders. We describe the whole architecture of this vision chip, as well as the pixel architecture. We also describe the evaluation of proposed algorithm with numerical simulation, as well as processing speed using FPGA, and improvement in resolution using column parallel architecture.

  • Boundary-Active-Only Adaptive Power-Reduction Scheme for Region-Growing Video-Segmentation

    Takashi MORIMOTO  Hidekazu ADACHI  Osamu KIRIYAMA  Tetsushi KOIDE  Hans Jurgen MATTAUSCH  

     
    LETTER-Image Processing and Video Processing

      Vol:
    E89-D No:3
      Page(s):
    1299-1302

    This letter presents a boundary-active-only (BAO) power reduction technique for cell-network-based region-growing video segmentation. The key approach is an adaptive situation-dependent power switching of each network cell, namely only cells at the boundary of currently grown regions are activated, and all the other cells are kept in low-power stand-by mode. The effectiveness of the proposed technique is experimentally confirmed with CMOS test-chips having small-scale cell networks of up to 4133 cells, where an average of only 1.7% of the cells remains active after application of the proposed approach. About 85% power reduction is thus achievable without sacrificing real-time processing.

  • A Coarse-Grain Hierarchical Technique for 2-Dimensional FFT on Configurable Parallel Computers

    Xizhen XU  Sotirios G. ZIAVRAS  

     
    PAPER-Parallel/Distributed Algorithms

      Vol:
    E89-D No:2
      Page(s):
    639-646

    FPGAs (Field-Programmable Gate Arrays) have been widely used as coprocessors to boost the performance of data-intensive applications [1],[2]. However, there are several challenges to further boost FPGA performance: the communication overhead between the host workstation and the FPGAs can be substantial; large-scale applications cannot fit in a single FPGA because of its limited capacity; mapping an application algorithm to FPGAs still remains a daunting job in configurable system design. To circumvent these problems, we propose in this paper the FPGA-based Hierarchical-SIMD (H-SIMD) machine with its codesign of the Pyramidal Instruction Set Architecture (PISA). PISA comprises high-level instructions implemented as FPGA functions of coarse-grain SIMD (Single-Instruction, Multiple-Data) tasks to facilitate ease of program development, code portability across different H-SIMD implementations and high performance. We assume a multi-FPGA board where each FPGA is configured as a separate SIMD machine. Multiple FPGA chips can work in unison at a higher SIMD level, if needed, controlled by the host. Additionally, by using a memory switching scheme and the high-level PISA to partition applications into coarse-grain tasks, host-FPGA communication overheads can be hidden. We enlist the two-dimensional Fast Fourier Transform (2D FFT) to test the effectiveness of H-SIMD. The test results show sustained high performance for this problem. The H-SIMD machine even outperforms a Xeon processor for this problem.

  • Entropy Based Evaluation of Communication Predictability in Parallel Applications

    Alex K. JONES  Jiang ZHENG  Ahmed AMER  

     
    PAPER-Performance Evaluation

      Vol:
    E89-D No:2
      Page(s):
    469-478

    The performance of parallel computing applications is highly dependent on the efficiency of the underlying communication operations. While often characterized as dynamic, these communication operations frequently exhibit spatial and temporal locality as well as regularity in structure. These characteristics can be exploited to improve communication performance if the correct prediction model is selected to a suitable communication topology. In this paper we describe an entropy based methodology for quantifying and evaluating the success of different prediction models on actual workloads drawn from representative parallel benchmarks. We evaluate two different prediction criteria and combinations thereof: (1) Messages are partitioned by source node. (2) Use of a first order context model. We also describe the threshold for predication designed to largely avoid incorrect predication overheads. Our results show for simple predication models, even on highly dynamic benchmark applications, predictability can be improved by several orders of magnitude. In fact, using simple prediction techniques, over 75% of the communication volume is accurately predictable.

  • A Resource-Shared VLIW Processor for Low-Power On-Chip Multiprocessing in the Nanometer Era

    Kazutoshi KOBAYASHI  Masao ARAMOTO  Hidetoshi ONODERA  

     
    PAPER-Digital

      Vol:
    E88-C No:4
      Page(s):
    552-558

    We propose a low-power resource-shared VLIW processor (RSVP) for future leaky nanometer process technologies. It consists of several single-way independent processor units (IPUs) that share parallel processor resources. Each IPU works as a variable-way VLIW processor sharing the parallel resources according to priorities of given tasks. RSVP allocates shared parallel resources to the IPUs cycle by cycle. It can minimize the number of NOPs that is wasting power. The performance per power (P3) of a 4-parallel 4-way RSVP that corresponds to four 4way VLIWs is 3.7% better than a conventional 4-parallel 4-way VLIW multiprocessor in the current 90 nm process. We estimate that the RSVP achieves 36% less leakage power and 28% better P3 in the future 25 nm process. We have fabricated an RSVP test chip that contains two IPU and a shared resource equivalent to two 2way VLIWs in a 180 nm process. It is functional at 100 MHz clock speed and its power is 130 mW.

  • An MAMS-PP4: Multi-Access Memory System Used to Improve the Processing Speed of Visual Media Applications in a Parallel Processing System

    Hyung LEE  Hyeon-Koo CHO  Dae-Sang YOU  Jong-Won PARK  

     
    PAPER-Concurrent Systems

      Vol:
    E87-A No:11
      Page(s):
    2852-2858

    To fulfill the computing demands in visual media processing, we have been investigating a parallel processing system to improve the processing speed of the visual media related to applications from the point of view of a memory system within a single instruction multiple data (SIMD) computer. In this paper, we have introduced MAMS-PP4, which is similar to a pipelined SIMD architecture type and consists of pq processing elements (PEs) as well as a multi-access memory system (MAMS). MAMS supports simultaneous access to pq data elements within a horizontal (1 pq), a vertical (pq 1) or a block (p q) subarray with a constant interval in an arbitrary position in an M N array of data elements, where the number of memory modules, m, is a prime number greater than pq. MAMS reduces the memory access time for an SIMD computer and also improves the cost and complexity that involved in controlling the large volume of data demanded in visual media applications. PE is designed to be a two-state machine in order to utilize MAMS efficiently. MAMS-PP4 was fabricated into ASIC using TOSHIBA TC240C series library and a test board was used to measure the performance of ASIC. The test board consists of devices such as an MPC860 embedded-PCI board, two ASICs and a FPGA for the control units. Experiment was done on various computer systems in order to compare the performance of MAMS-PP4 using morphological operations as the application. MAMS-PP4 shows a respectful and consistent processing speed.

21-40hit(118hit)

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.